Skip to content

Conversation

@MaxKellermann
Copy link
Contributor

These patches optimize large reads by skipping the buffer when appropriate. This saves lots of read() system calls and reduces unnecessary memory copies.

@MaxKellermann
Copy link
Contributor Author

MaxKellermann commented Feb 4, 2022

Azure CI failures unrelated to this PR.

The other unit test failures are ... debatable.

ext/standard/tests/streams/stream_read_object_return.phpt fails because after my changes, two warnings get emitted by the unit test class

Warning: file_get_contents(): MyStream::stream_set_option is not implemented! in /home/max/svn/cmag/php/ext/standard/tests/streams/stream_read_object_return.php on line 16
Warning: file_get_contents(): MyStream::stream_eof is not implemented! Assuming EOF in /home/max/svn/cmag/php/ext/standard/tests/streams/stream_read_object_return.php on line 16
string(0) ""

The PHP core attempts to disable buffering (set_option) and has an additional "eof" check to avoid an unnecessary read() system call when we already know we're at end-of-file. That is the whole point of these patches, but this unit test makes very strong assumptions on how this is implemented, and I think the unit test should be changed to adapt to those implementation details. Other opinions?

About ext/standard/tests/streams/stream_set_chunk_size.phpt, it's before:

should elicit one read of size 100 (chunk size)
read with size: 100
int(100)

after:

should elicit one read of size 100 (chunk size)
read with size: 250
int(250)

With my change, PHP circumvents the read buffer if it is empty and a read was requested of more than the buffer size. Using the buffer in such a case, when reading a big portion of a file with one call, results in unnecessary copying memory around (from stream to read buffer, from read buffer to return value); the buffer also limits each read to a 8 kB chunk at most. This unit test expects PHP to not do that. Opinions here?

If desired, I can amend this PR to also include the necessary unit test changes.

I can also try to figure out how to do these optimizations only for "plain file" streams, where the benefit is very noticable.

@staabm
Copy link
Contributor

staabm commented Feb 4, 2022

do you have any numbers, of how much faster this should get with this change?

@MaxKellermann
Copy link
Contributor Author

MaxKellermann commented Feb 4, 2022

do you have any numbers, of how much faster this should get with this change?

Here you have excerpts from an strace which loads google_fonts.json (>700 kB) before this change:

1643919963.013260 open("google_fonts.json", O_RDONLY|O_LARGEFILE) = 8 <0.000015>
1643919963.013303 fstat(8, {st_mode=S_IFREG|0444, st_size=718571, ...}) = 0 <0.000012>
1643919963.013343 lseek(8, 0, SEEK_CUR) = 0 <0.000012>
1643919963.013381 fstat(8, {st_mode=S_IFREG|0444, st_size=718571, ...}) = 0 <0.000012>
1643919963.013424 read(8, "{\n  \"kind\": \"webfonts#webfontLis"..., 8192) = 8192 <0.000015>
1643919963.013471 read(8, "2/FeVfS0NQpLYgrjJbC5FxxbU.ttf\",\n"..., 8192) = 8192 <0.000015>
[... a total of 90 read() system calls ... ]
1643919963.017463 read(8, "   \"latin\"\n      ],\n      \"versi"..., 8192) = 5867 <0.000013>
1643919963.017505 read(8, "", 8192) = 0 <0.000011>
1643919963.017541 read(8, "", 8192) = 0 <0.000011>
1643919963.017577 close(8)        = 0 <0.000012>

This finishes after 4.3ms.
After this change:

1643924380.436340 open("google_fonts.json", O_RDONLY|O_LARGEFILE) = 8 <0.000011>
1643924380.436373 fstat(8, {st_mode=S_IFREG|0644, st_size=913689, ...}) = 0 <0.000010>
1643924380.436405 lseek(8, 0, SEEK_CUR) = 0 <0.000010>
1643924380.436435 fstat(8, {st_mode=S_IFREG|0644, st_size=913689, ...}) = 0 <0.000010>
1643924380.436467 read(8, "{\n  \"kind\": \"webfonts#webfontLis"..., 921881) = 913689 <0.000210>
1643924380.436700 read(8, "", 8192) = 0 <0.000010>
1643924380.436732 close(8)        = 0 <0.000011>

This is the full strace, not abbreviated, and it finishes after 0.39ms - that's a speedup factor of 10.

This is, of course, a somewhat degenerate microbenchmark, and the numbers are not exactly reliable (only one iteration and I have recorded only wallclock time, not CPU time), but I think it's good enough to demonstrate my point. I think it is obvious enough that reducing 90 read() system calls to just 2 is a huge improvement.

This was on a machine without Spectre/Meltdown mitigations. With those in place, the system call overhead will be even larger, increasing the gap between existing PHP code and my PR code.

@MaxKellermann
Copy link
Contributor Author

I did another benchmark, PHP source code:

<?php
for ($i = 1; $i <= 10000; $i++)
    file_get_contents('/bin/bash');

Before this PR:

          2,153.15 msec task-clock:u              #    1.000 CPUs utilized          
     2,863,801,211      cycles:u                  #    1.330 GHz                    
       271,987,578      instructions:u            #    0.09  insn per cycle         
        72,994,141      branches:u                #   33.901 M/sec                  
            85,598      branch-misses:u           #    0.12% of all branches        

    44.54%  php      libc-2.33.so      [.] __memmove_avx_unaligned_erms
    21.37%  php      [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
     4.40%  php      [kernel.vmlinux]  [k] filemap_get_read_batch
     3.65%  php      [kernel.vmlinux]  [k] filemap_read
     3.51%  php      [kernel.vmlinux]  [k] copy_page_to_iter
     1.47%  php      [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     1.35%  php      php               [.] _php_stream_fill_read_buffer
     1.26%  php      [kernel.vmlinux]  [k] syscall_return_via_sysret
     1.03%  php      [kernel.vmlinux]  [k] sched_clock
     0.90%  php      [kernel.vmlinux]  [k] map_id_range_down
     0.85%  php      php               [.] _php_stream_read

After this PR:

          1,230.20 msec task-clock:u              #    0.999 CPUs utilized          
        50,945,195      cycles:u                  #    0.041 GHz                    
        43,816,369      instructions:u            #    0.86  insn per cycle         
         9,013,545      branches:u                #    7.327 M/sec                  
            91,760      branch-misses:u           #    1.02% of all branches        

    77.36%  php      [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
     5.84%  php      [kernel.vmlinux]  [k] filemap_read
     5.66%  php      [kernel.vmlinux]  [k] filemap_get_read_batch
     5.10%  php      [kernel.vmlinux]  [k] copy_page_to_iter
     0.64%  php      [kernel.vmlinux]  [k] folio_mark_accessed
     0.32%  php      [kernel.vmlinux]  [k] __cond_resched
     0.22%  php      [kernel.vmlinux]  [k] filemap_get_pages
     0.19%  php      [kernel.vmlinux]  [k] rcu_all_qs
     0.18%  php      [kernel.vmlinux]  [k] mark_page_accessed
     0.17%  php      [kernel.vmlinux]  [k] kmem_cache_alloc
     0.16%  php      [kernel.vmlinux]  [k] syscall_return_via_sysret
     0.14%  php      [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     0.11%  php      php               [.] _efree

You can clearly see that the existing PHP code wastes most of the CPU time copying around data from the stream read buffer to the final string, and that disappears completely with my improvements. What remains is mostly copying the file data from the kernel's page cache to userspace.

@cmb69
Copy link
Member

cmb69 commented Feb 4, 2022

The failing stream_read_object_return.phpt exhibits an BC break. :(

@MaxKellermann
Copy link
Contributor Author

The failing stream_read_object_return.phpt exhibits an BC break. :(

What is a BC break? Is it caused by my change or just revealed by it?

@cmb69
Copy link
Member

cmb69 commented Feb 4, 2022

"BC break" == "backwards compatibility break". The mentioned test now fails, because unimplemented streamWrapper methods would now be called. That might be an issue (and might be fixable), or not. In any way, we're likely better off to turn the streamWrapper class into a set of interfaces in the long run.

@MaxKellermann
Copy link
Contributor Author

MaxKellermann commented Feb 4, 2022

"BC break" == "backwards compatibility break". The mentioned test now fails, because unimplemented streamWrapper methods would now be called.

I have added a php_stream_is(stream, PHP_STREAM_IS_STDIO) check to the first patch, so it only runs with regular files - which is a good enough optimization already, for now. We might extend the scope of this optimization later. What do you think?

About the stream_set_chunk_size.phpt failure (caused by the second patch): I think here, really the unit test is broken. I added another php_stream_eof() call to a function which already has one such call in a different code path (maxlen>0). So using this function should not break anything (which was not already broken before).

Existing (old) code:

php-src/main/streams/streams.c

Lines 1477 to 1480 in d67698a

if (maxlen > 0) {
result = zend_string_alloc(maxlen, persistent);
ptr = ZSTR_VAL(result);
while ((len < maxlen) && !php_stream_eof(src)) {

... and the unit test fails because it does not expect the eof method to be called.

@MaxKellermann
Copy link
Contributor Author

I added another patch which gives another 14% microbenchmark speedup by eliminating the pointless empty read() at the end, and by allocating the right buffer size from the beginning.


// TODO: Propagate error?
while ((ret = php_stream_read(src, ptr, max_len - len)) > 0){
while (!php_stream_eof(src) && (ret = php_stream_read(src, ptr, max_len - len)) > 0){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, php_stream_eof() also generates some system calls (poll(), recv() with MSG_PEEK flag).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if the eof flag has not yet been set:

if (!stream->eof && PHP_STREAM_OPTION_RETURN_ERR ==
php_stream_set_option(stream, PHP_STREAM_OPTION_CHECK_LIVENESS,
0, NULL)) {

But you are right, PHP's "xp_streams" implementation is rather clumsy. The way it is implemented, my change adds one system call for streams (replacing recv() with poll()+recv()); it still eliminates one extra read() system call for regular files!
The xp_streams code could be optimized easily. Anybody interested?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two system calls, when it's not at the EOF, we got poll() + recv() + recv(), and CHECK_LIVENESS call usually returns true.
php_stream is not only designed for files, it also works for sockets.

The xp_streams code could be optimized easily. Anybody interested?

How? I can not understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two system calls, when it's not at the EOF, we got poll() + recv() + recv()

Oh yes, you are right. I thought about the "eof" case, but forgot about the "not-eof".

How? I can not understand.

By omitting the poll() on operating systems where MSG_DONTWAIT is available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always assume that doing so might cause a BC break.
In addition, this can also be applied when the stream is in non-blocking mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always assume that doing so might cause a BC break.

How?

In addition, this can also be applied when the stream is in non-blocking mode.

You mean on operating systems which don't support MSG_DONTWAIT? (i.e. Windows)

@twose
Copy link
Member

twose commented Feb 9, 2022

This particular optimization does work for some particular cases, but it seems like we don't have benchmark scripts to show its impact on other common scripts.
To be honest, I don't like this complexity... And trying to implement zero-copy on such a generic abstract implementation always adds complexity and sacrifices maintainability.
At least, always call php_stream_eof() before php_stream_read() is definitely not a good idea for me here.

@MaxKellermann
Copy link
Contributor Author

This particular optimization does work for some particular cases, but it seems like we don't have benchmark scripts to show its impact on other common scripts.

My benchmark is number of system calls (which in the end, always boils down to faster execution). Of course, you are right, the number of system calls in other ways to use this API is of interest as well.

How can I contribute benchmarks to demonstrate the effectiveness of my optimizations? Are there existing examples of how these shall be integrated in php-src?

To be honest, I don't like this complexity... And trying to implement zero-copy on such a generic abstract implementation always adds complexity and sacrifices maintainability.

I don't like complexity either, but adding some slight well-reviewed complexity at a central place like the PHP language implementation to get performance advantages is well worth it. That's why PHP has complexity monsters like a JIT these days - the horrendous complexity pays off. My optimizations add only very little complexity, with big benefits.

At least, always call php_stream_eof() before php_stream_read() is definitely not a good idea for me here.

Maybe I should attack this from a different angle: the stream implementation for regular files should avoid the read() completely when it already knows (from the current file offset) that it's at EOF. What do you think?

@staabm
Copy link
Contributor

staabm commented Feb 9, 2022

would it make sense / be possible to separate the optimizatoin into a path which is only taken for local files and leave the path for non-file types as is?

@MaxKellermann
Copy link
Contributor Author

At least, always call php_stream_eof() before php_stream_read() is definitely not a good idea for me here.

@twose, then what do you think of this piece of code?

php-src/main/streams/streams.c

Lines 1480 to 1481 in 05023a2

while ((len < maxlen) && !php_stream_eof(src)) {
ret = php_stream_read(src, ptr, maxlen - len);

This is current git master. My patch just copied what was there in the same function, in a different code path.

@MaxKellermann
Copy link
Contributor Author

would it make sense / be possible to separate the optimizatoin into a path which is only taken for local files and leave the path for non-file types as is?

Possible, yes. I could easily add another php_stream_is(stream, PHP_STREAM_IS_STDIO) check. Make sense? - that's your choice, I'm newbie to PHP development. If you want it, I'll amend the php_stream_eof() patch.

Though - as I just replied to @twose - my php_stream_eof() call is not new, the same is being done in a different code path in the same function (if a maxlen parameter was given by the caller). Shall I add the same condition here as well? For me, it doesn't make sense to allow or disallow the "eof" call depending on whether the caller has given a maxlen. What is your opinion?

@MaxKellermann
Copy link
Contributor Author

At least, always call php_stream_eof() before php_stream_read() is definitely not a good idea for me here.

@twose, more of the same from git master:

while(!php_stream_eof(dba->fp)) {
/* read in the length of the key name */
if (!php_stream_gets(dba->fp, buf, 15)) {

while (!php_stream_eof(dba->fp)) {
if (!php_stream_gets(dba->fp, buf, 15)) {

while(!php_stream_eof(dba->fp)) {
if (!php_stream_gets(dba->fp, buf, 15)) {

while (!php_stream_eof(instream) && (ch = php_stream_getc(instream))!=EOF) {

while (!php_stream_eof(instream) && (ch = php_stream_getc(instream))!=EOF) {

while (!php_stream_eof(ftp->stream) && (ch = php_stream_getc(ftp->stream)) != EOF) {

php-src/ext/phar/phar.c

Lines 1652 to 1653 in 05023a2

while(!php_stream_eof(fp)) {
if ((got = php_stream_read(fp, buffer+tokenlen, readsize)) < (size_t) tokenlen) {

while (md->ulc || (!php_stream_eof(md->stream) && (ch = php_stream_getc(md->stream)))) {

while (!php_stream_eof(s)) {
char buf[SAPI_POST_HANDLER_BUFSIZ] = {0};
ssize_t len = php_stream_read(s, buf, SAPI_POST_HANDLER_BUFSIZ);

The php_stream_eof() + php_stream_read() appears to be a very common thing in the PHP code base. And many of these examples are unnecessary bloat.

@twose
Copy link
Member

twose commented Feb 9, 2022

@MaxKellermann

Yes. php_stream_eof() is everywhere. For example, phpredis will check EOF before every request and reconnect automatically if the connection is down, instead of reconnecting after request failures. Be sure, to use php_stream_eof is indeed the easiest way to implement such features and it works well a lot of the time, but I really don't like this way very much, it hurts performance, and it's not absolutely reliable, imagine that the socket connection might break after check_liveness immediately. Many people rely on this behavior without dealing with errors that may occur. (BTW, MySQL client cannot go the same way because it cannot restore the context after re-connect).

And, in my opinion, php_stream doesn't have a very strong focus on performance. And you're right, it does have a lot to optimize for. But, I'm also not sure if it would cause any other problems (e.g. BC break) if we changed the code. So I always tend to focus on new code instead of the code that already exists. It's too difficult for me to find the reasons for the existence of a historical code...

I'm just focused on network programming but I'm not familiar enough with php_stream implementation...
And I'm just expressing my concerns and want to get your attention, but I don't actually have more specific ideas :)

@MaxKellermann
Copy link
Contributor Author

And, in my opinion, php_stream doesn't have a very strong focus on performance.

It's used everywhere both by PHP itself and by PHP applications, and its unnecessary overhead is very noticable. On my setup, a bunch of micro-improvements made a huge difference.

We agree very much that php_stream_eof() should be avoided, because it adds overhead and usually doesn't solve the problem (due to TOCTOU) - after all, I'm here only to help get rid of unnecessaty overhead. Just in my particular patch here, it doesn't make a practical difference, because the precent for using it in this very function is already there.
My suggestion is to not make this aspect a showstopper for merging now, and after merging, talk about how to get rid of unnecessary php_stream_eof() calls, including the one that exists already in _php_stream_copy_to_mem(). I'm waiting for the PHP maintainers to decide.

@MaxKellermann
Copy link
Contributor Author

@twose I added a patch which optimizes php_sockop_set_option(), optimizing away unnecessary poll() system calls using MSG_DONTWAIT.

The read buffer is useless here, it only hurts performance.
This eliminates the last redundant read() system call in every
file_get_contents() call.
If the read buffer is currently empty, don't buffer large reads
(larger than chunk_size), because buffering would only hurt
performance in such an edge case.
If we know the exact file size already, we can allocate the right
buffer size and issue only one read() system call.  This eliminates
unnecessary buffer allocations (file size plus 8 kB), eliminates
buffer reallocations via zend_string_truncate(), and eliminates the
last read() system call.

Benchmark:

 echo Hello World >/tmp/hello.txt
 <?php for ($i = 1; $i <= 1000000; $i++) file_get_contents('/tmp/hello.txt');

Previously:

 openat(AT_FDCWD, "/tmp/hello.txt", O_RDONLY) = 3
 newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=12, ...}, AT_EMPTY_PATH) = 0
 lseek(3, 0, SEEK_CUR)                   = 0
 newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=12, ...}, AT_EMPTY_PATH) = 0
 read(3, "Hello World\n", 8204)          = 12
 read(3, "", 8192)                       = 0
 close(3)                                = 0

          4,036.55 msec task-clock:u              #    0.995 CPUs utilized
     4,683,289,286      cycles:u                  #    1.160 GHz
     7,954,269,313      instructions:u            #    1.70  insn per cycle
     1,165,461,174      branches:u                #  288.727 M/sec
         1,431,759      branch-misses:u           #    0.12% of all branches

Optimized:

 openat(AT_FDCWD, "/tmp/hello.txt", O_RDONLY) = 3
 newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=12, ...}, AT_EMPTY_PATH) = 0
 lseek(3, 0, SEEK_CUR)                   = 0
 newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=12, ...}, AT_EMPTY_PATH) = 0
 read(3, "Hello World\n", 12)            = 12
 close(3)                                = 0

          3,473.51 msec task-clock:u              #    0.997 CPUs utilized
     4,132,407,666      cycles:u                  #    1.190 GHz
     6,841,273,246      instructions:u            #    1.66  insn per cycle
     1,024,464,752      branches:u                #  294.936 M/sec
         1,478,403      branch-misses:u           #    0.14% of all branches

For this micro-benchmark, the improvement is 14%.

This patch needs to adjust the unit test "bug54946.phpt" because the
EBADF warning is no longer logged.
@MaxKellermann
Copy link
Contributor Author

I have rebased this PR branch, but of course, the unit test failures are still there (they fail because they rely on implementation details).
What is the PHP team's opinion on these optimizations? Is there interest in merging them? If yes, let's talk about how to fix the unit tests.
If there is no interest (or no feedback), I'll close this PR. It's fine for me to not share my work on improving PHP I/O performance, and have them only on our own servers.

@MaxKellermann MaxKellermann deleted the read-optimizations branch March 28, 2022 03:51
@iluuu1994
Copy link
Member

I don't necessarily think that there isn't an interest to merge this but rather that nobody with the required knowledge has had time to review it yet.

@MaxKellermann
Copy link
Contributor Author

If there's interest, contact me and I'll resubmit this PR. Meanwhile, you can check #8092 which is a related micro-optimization, but doesn't break any (fragile) unit test.


/* disabling the read buffer allows doing the whole transfer
in just one read() system call */
if (php_stream_is(stream, PHP_STREAM_IS_STDIO))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxKellermann I don't quite understand why this pull request got closed in it's entirety. I know there is an issue around commits 2+3, but commit 1 (file_get_contents without read buffer) is a clear, simple and non-breaking and uncontroversial improvement IMO. Maybe commit #e45f829 can be extracted into a new pull request to get merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK here you are #8547

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants